{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "584ea112ddc4d14e",
   "metadata": {},
   "source": [
    "# 第36章 多臂老虎机"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65b08216",
   "metadata": {},
   "source": [
    "## 习题36.1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e93fc9c",
   "metadata": {},
   "source": [
    "&emsp;&emsp;探索优先算法中首先对每个手臂$a$进行$N$次选择，得到估计的期望奖励$\\hat{Q}(a)$，然后一直选择估计的期望奖励最大的手臂，总轮数是$T$。证明以下概率不等式成立。\n",
    "$$\n",
    "P\\left[ \\hat{Q}(a) \\leqslant Q(a) + \\sqrt{\\frac{2 \\log T}{N}} \\right] \\geqslant 1 - \\frac{1}{T^4}\n",
    "$$\n",
    "其中，$Q(a)$是手臂$a$真实的期望奖励。提示：使用Hoeffding不等式。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a24401c0",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a26f336d",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3abf0c86",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "739a6d06",
   "metadata": {},
   "source": [
    "## 习题36.2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1297c611",
   "metadata": {},
   "source": [
    "&emsp;&emsp;假设多臂老虎机有三个手臂，奖励取值都是0或1，分布都是伯努利分布，期望分别是$Q(a_1) = 0.4,Q(a_2)=0.5,Q(a_3)=0.6$。与老虎机交互的总轮数是$T=10000$。通过模拟计算探索优先算法（$N=100$）、$\\varepsilon$贪心算法（$\\varepsilon = 0.2$）、UCB算法、汤普森采样算法在这个多臂老虎机的总体奖励。各自进行10次模拟后取平均。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62e1bb10",
   "metadata": {},
   "source": [
    "&emsp;&emsp;UCB算法使用更紧的上界\n",
    "$$\n",
    "\\frac{n(a)}{N(a)} + \\sqrt{\\frac{\\frac{n(a)}{N(a)} \\log t}{N(a)}} + \\frac{\\log t}{N(a)}\n",
    "$$\n",
    "其中，$t$是轮数，$N(a)$是手臂$a$被选择的次数，是$n(a)$中选择手臂结果为1的次数。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10466ef6",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05a83efb",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e92fe05",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2ac97c0",
   "metadata": {},
   "source": [
    "## 习题36.3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22cae8ee",
   "metadata": {},
   "source": [
    "&emsp;&emsp;$\\varepsilon$贪心算法中的探索概率$\\varepsilon$可以不是一个定量，而是一个随着轮数$t$递减的变量$\\varepsilon_t$。设计一个$\\varepsilon_t$的函数，并观察这个$\\varepsilon$贪心算法在例题36.2中的效果。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59980dc3",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a76104a",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19b9a258",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "659e7f89",
   "metadata": {},
   "source": [
    "## 习题36.4"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78e7497d",
   "metadata": {},
   "source": [
    "&emsp;&emsp;解析为什么汤普森采样算法中$s$值越小越趋向探索，而$s$值越大越趋向利用。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "903f07ed",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35e702cf",
   "metadata": {},
   "source": [
    "**解答思路：**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e0b032b",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6fb03f6d",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
